fix: M1/M2 backward compatibility — MIL syntax + fp16 I/O fallback (rebased)#8
Draft
codegen-sh[bot] wants to merge 3 commits intomainfrom
Draft
fix: M1/M2 backward compatibility — MIL syntax + fp16 I/O fallback (rebased)#8codegen-sh[bot] wants to merge 3 commits intomainfrom
codegen-sh[bot] wants to merge 3 commits intomainfrom
Conversation
Port upstream PR #6 (imperatormk) - fixes MIL scalar type syntax from M4-only shorthand to canonical verbose format that compiles on all Apple Silicon (M1/M2/M3/M4). Changes: - program(1.3) to program(1.0), ios18 to ios16 target - Scalar type shorthand to canonical verbose format - Simplified buildInfo dict (no M4-specific version strings) - fp16 I/O fallback: g_fp16_io flag with auto-retry on compile failure for M1/M2 where cast op is unsupported - Dynamic IOSurface byte calculation (bpe: 2 for fp16, 4 for fp32) Tested on M1 Pro, macOS 26.3 (per upstream PR author).
train.m includes ane_mil_gen.h (via backward.h -> model.h) which declares extern int g_fp16_io, but train.m never defined it -- producing an undefined symbol linker error. Changes: - train.m: add g_fp16_io = 0 at file scope, wrap model_compile_kernels with auto-retry (try fp32, on fail set g_fp16_io=1, retry fp16) - model.h: compile_conv_kernel IOSurface byte calculation now uses g_fp16_io ? 2 : 4 (was hardcoded to 4) - .gitignore: add train binary + test/probe binaries
Backports from imperatormk/ane-train:
1. Disk compile cache (ane_set_cache_dir / ane_enable_cache)
Persists compiled kernels to ~/.cache/ane_compile/ — saves
100-500ms per kernel on subsequent runs.
2. ane_rewire() — zero-copy IOSurface pointer swap for kernel chaining.
Enables activation chaining, gradient routing, weight ping-pong
without CPU roundtrips.
3. Non-nil weights dict fix — passing nil wdict to modelWithMILText:
silently returns nil. Now passes @{} for weight-free kernels.
4. ANE_TRAINING.md — comprehensive constraint cheatsheet covering
tensor layout, IOSurface slot ordering, broadcast rules, the sqrt
bug, variable naming pitfalls, and proven training patterns.
All findings from direct M1/M1 Pro probing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rebased version of PR #3, now current with all 27 upstream commits on
main.What this does
Makes the ANE training pipeline run on M1/M2 chips (which lack the
castMIL op on their ANE) by:MIL syntax fixes — Downgraded from
program(1.3)/ios18toprogram(1.0)/ios16; replaced barestring,bool,int32scalars withtensor<T, []>rank-0 form; swappedstring(...)fortensor<string, []>(...)everywhereDual I/O paths —
g_fp16_ioflag selects between:fp16, nocastops → ANE accepts the graphfp32withcastto/fromfp16internallyAuto-retry —
model_compile_kernels/bench()tries fp32 first, on compile failure flipsg_fp16_io=1and retries fp16, else falls back to CPUIOSurface byte sizing —
g_fp16_io ? 2 : 4bytes-per-element for buffer allocationCheckpoint persistence —
CkptHeadersaves/restoresg_fp16_ioso training resumes with the correct I/O modeRebase conflict resolution
Integrated upstream's
boolreturn type + error handling + atomic checkpoint writes with our M2 compatibility patches. 4 conflict regions intiny_train.mresolved manually, keeping the best of both.Supersedes #3.
💻 View my work • 👤 Initiated by @dermitchell1993 • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks